GitHub
The code for the herein described process can also be freely downloaded from https://github.com/fzenoni/london-accidents.
“He didn’t notice that the lights had changed”
Besides being able to “put dots on a map”, R can be used in very interesting ways for spatial analytics. Last month, Stefano performed some descriptive statistics on a Kaggle dataset that includes 9 years of UK accidents. That dataset is so rich that I got inspired in multiple ways.
I will try to offer an answer to the following questions: what if the government wanted to highlight the areas of a city showing some alarming characteristics, with a given statistical significance? What if we wanted to discover what are London’s most dangerous roads or intersections for car drivers, purely by retaining their location and some measurement of severity?
Unfortunately, to analyze the entire London area happens to be too resource intensive, and this is why I will subset an area enclosed in a radius of 4 km. Nevertheless, the method shown stays valid at any scale.
Data preparation
First things first: we load the data and clean it a bit. The fastest way to do it in-memory, while enjoying the functions devoted to tables, is still to use the data.table package, together with the selection of the strictly necessary amount of columns.
# Load data
set <- list.files(path = 'london-accidents-data',
pattern = 'accidents.*csv', full.names = TRUE)
cols_to_keep <- c('Location_Easting_OSGR', 'Location_Northing_OSGR', 'Number_of_Casualties')
data <- lapply(set, fread, select = cols_to_keep) %>% rbindlist
# Filter out empty locations
data <- na.omit(data, cols = cols_to_keep)
# Remove duplicates
data <- data[!duplicated(data)]
data.table is nice and all, but since we work with spatial data we’re going to use the sf format, as we did in the past. As sf does not exactly extend data.table, I’m going to cast the table to a data.frame first. Note the CRS, corresponding to the British Ordnance Survey National Grid.
data <- st_as_sf(data.frame(data), coords = c('Location_Easting_OSGR', 'Location_Northing_OSGR'), crs = 27700)
As anticipated, this data include accidents over 9 years for the whole of UK. It represents a lot of records, and performance wise, I don’t necessarily have a strategy in place to analyze them all. As a first move, I’m going to select the events that fall inside the boroughs of London’s administrative boundaries. To help with this task, the Kaggle dataset includes the geoJSON of UK’s districts.
I must confess that until very recently, I’ve had mixed feelings concerning this format. Luckily I changed my mind thanks to the discovery of two methods to open and import such a file in R as sf.
The first one is sf::read_sf().
system.time(sf::read_sf('london-accidents-data/Local_Authority_Districts_Dec_2016.geojson'))
## user system elapsed
## 7.564 1.759 9.781
But the freshest discovery is in fact the geojsonsf::geojson_sf() function from SymbolixAU (check their blog post post here), that serves exactly our purpose, in an even faster way than sf’s method.
system.time(map <- geojsonsf::geojson_sf('london-accidents-data/Local_Authority_Districts_Dec_2016.geojson'))
## user system elapsed
## 2.388 0.096 2.955
Let’s quickly inspect the map object.
head(map)
## Simple feature collection with 6 features and 10 fields
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: -2.832457 ymin: 53.30503 xmax: -0.7884304 ymax: 54.72717
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## bng_e bng_n geometry lad16cd
## 1 447157 531476 MULTIPOLYGON (((-1.268456 5... E06000001
## 2 451141 516887 MULTIPOLYGON (((-1.243904 5... E06000002
## 3 464359 519597 MULTIPOLYGON (((-1.137578 5... E06000003
## 4 444937 518183 MULTIPOLYGON (((-1.317286 5... E06000004
## 5 428029 515649 POLYGON ((-1.637678 54.6171... E06000005
## 6 354246 382146 MULTIPOLYGON (((-2.626835 5... E06000006
## lad16nm lad16nmw lat long objectid st_areashape
## 1 Hartlepool 54.67616 -1.27023 1 93559511
## 2 Middlesbrough 54.54467 -1.21099 2 53888581
## 3 Redcar and Cleveland 54.56752 -1.00611 3 244820281
## 4 Stockton-on-Tees 54.55691 -1.30669 4 204962233
## 5 Darlington 54.53535 -1.56835 5 197475689
## 6 Halton 53.33424 -2.68853 6 79084033
## st_lengthshape
## 1 71707.33
## 2 43840.85
## 3 97993.29
## 4 119581.51
## 5 107206.28
## 6 77770.94
Now I would like to extract data relevant to London’s boroughs, but I am bored by having to inspect the map, and the need to learn new geography. I am the laziest member of the team, and as such I refuse to copy-paste 33 names into a list. Therefore, I decided to harvest this Wikipedia page with the rvest package, and use the list to filter the regions of interest.
# Harvest the list of London boroughs from Wikipedia
wiki_london <- xml2::read_html('https://en.wikipedia.org/wiki/London_boroughs')
boroughs1 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[1]] %>% rvest::html_text()
boroughs2 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[2]] %>% rvest::html_text()
list1 <- as.list(strsplit(boroughs1, "\n")) %>% unlist
list2 <- as.list(strsplit(boroughs2, "\n")) %>% unlist
list <- c(list1, list2)
# Special cases to fix
list <- replace(list, list=='City of London (not a London borough)', 'City of London')
list <- replace(list, list=='City of Westminster', 'Westminster')
# Inspect
list
## [1] "City of London" "Westminster"
## [3] "Kensington and Chelsea" "Hammersmith and Fulham"
## [5] "Wandsworth" "Lambeth"
## [7] "Southwark" "Tower Hamlets"
## [9] "Hackney" "Islington"
## [11] "Camden" "Brent"
## [13] "Ealing" "Hounslow"
## [15] "Richmond upon Thames" "Kingston upon Thames"
## [17] "Merton" "Sutton"
## [19] "Croydon" "Bromley"
## [21] "Lewisham" "Greenwich"
## [23] "Bexley" "Havering"
## [25] "Barking and Dagenham" "Redbridge"
## [27] "Newham" "Waltham Forest"
## [29] "Haringey" "Enfield"
## [31] "Barnet" "Harrow"
## [33] "Hillingdon"
# Filter map
london <- map %>% dplyr::filter(lad16nm %in% list)
# Unite the boroughs
london_union <- london %>% sf::st_union()
The London map is ready to be displayed.
plot(london_union)

As mentioned before, I originally wanted to analyze all of London’s data (if not all of UK), but I quickly realized I would have needed to parallelize a large part of the analysis code. I could achieve that for some part, but given that some functions were too complicated to dissect and rewrite, instead, I selected only the data falling into a radius of 4000 m from London’s centroid. For sf aficionados, this last geometrical task is a trivial one.
# Project to British National Grid (http://epsg.io/27700)
data <- data %>% st_transform(crs = 27700)
london_union <- london_union %>% st_transform(crs = 27700)
# Build a circle
center <- st_centroid(london_union)
max_radius <- 4000
london_circle <- st_buffer(center, max_radius)
# Filter data thanks to map
london_data <- data[london_circle,]
This is what we got. I know, I know, it’s a small sample!
# Original amount of data
nrow(data)
## [1] 1281716
# Filtered data
nrow(london_data)
## [1] 18069
# Draw the area
plot(london_union)
plot(london_circle, add = T, col = sf.colors(n = 1, alpha = 0.3))

# Display the accidents as points
mapview(london_data, map.types = "Esri.WorldImagery")